1 00:00:14,752 --> 00:00:21,696 - All right welcome to lecture nine. So today we will be talking about CNN Architectures. 2 00:00:21,696 --> 00:00:27,706 And just a few administrative points before we get started, assignment two is due Thursday. 3 00:00:27,706 --> 00:00:36,855 The mid term will be in class on Tuesday May ninth, so next week and it will cover material through Tuesday through this coming Thursday May fourth. 4 00:00:36,855 --> 00:00:41,350 So everything up to recurrent neural networks are going to be fair game. 5 00:00:41,350 --> 00:00:49,121 The poster session we've decided on a time, it's going to be Tuesday June sixth from twelve to three p.m. So this is the last week of classes. 6 00:00:49,121 --> 00:00:53,828 So we have our our poster session a little bit early during the last week so that after that, 7 00:00:53,828 --> 00:01:00,132 once you guys get feedback you still have some time to work for your final report which will be due finals week. 8 00:01:03,325 --> 00:01:05,812 Okay, so just a quick review of last time. 9 00:01:05,812 --> 00:01:09,324 Last time we talked about different kinds of deep learning frameworks. 10 00:01:09,324 --> 00:01:12,690 We talked about you know PyTorch, TensorFlow, Caffe2 11 00:01:14,514 --> 00:01:18,762 and we saw that using these kinds of frameworks we were able to easily build big computational graphs, 12 00:01:18,762 --> 00:01:25,784 for example very large neural networks and comm nets, and be able to really easily compute gradients in these graphs. 13 00:01:25,784 --> 00:01:32,415 So to compute all of the gradients for all the intermediate variables weights inputs and use that to train our models 14 00:01:32,415 --> 00:01:35,665 and to run all this efficiently on GPUs 15 00:01:37,658 --> 00:01:44,978 And we saw that for a lot of these frameworks the way this works is by working with these modularized layers that you guys have been working writing with, 16 00:01:44,978 --> 00:01:49,928 in your home works as well where we have a forward pass, we have a backward pass, 17 00:01:49,928 --> 00:01:58,404 and then in our final model architecture, all we need to do then is to just define all of these sequence of layers together. 18 00:01:58,404 --> 00:02:04,937 So using that we're able to very easily be able to build up very complex network architectures. 19 00:02:06,626 --> 00:02:14,520 So today we're going to talk about some specific kinds of CNN Architectures that are used today in cutting edge applications and research. 20 00:02:14,520 --> 00:02:19,631 And so we'll go into depth in some of the most commonly used architectures for these that are winners 21 00:02:19,631 --> 00:02:22,125 of ImageNet classification benchmarks. 22 00:02:22,125 --> 00:02:28,085 So in chronological order AlexNet, VGG net, GoogLeNet, and ResNet. 23 00:02:28,085 --> 00:02:43,771 And so these will go into a lot of depth. And then I'll also after that, briefly go through some other architectures that are not as prominently used these days, but are interesting either from a historical perspective, or as recent areas of research. 24 00:02:46,822 --> 00:02:50,839 Okay, so just a quick review. We talked a long time ago about LeNet, 25 00:02:50,839 --> 00:02:55,603 which was one of the first instantiations of a comNet that was successfully used in practice. 26 00:02:55,603 --> 00:03:05,778 And so this was the comNet that took an input image, used com filters five by five filters applied at stride one and had a couple of conv layers, 27 00:03:05,778 --> 00:03:09,335 a few pooling layers and then some fully connected layers at the end. 28 00:03:09,335 --> 00:03:14,320 And this fairly simple comNet was very successfully applied to digit recognition. 29 00:03:17,030 --> 00:03:22,875 So AlexNet from 2012 which you guys have also heard already before in previous classes, 30 00:03:22,875 --> 00:03:31,179 was the first large scale convolutional neural network that was able to do well on the ImageNet classification 31 00:03:31,179 --> 00:03:40,611 task so in 2012 AlexNet was entered in the competition, and was able to outperform all previous non deep learning based models by a significant margin, 32 00:03:40,611 --> 00:03:48,012 and so this was the comNet that started the spree of comNet research and usage afterwards. 33 00:03:48,012 --> 00:03:56,427 And so the basic comNet AlexNet architecture is a conv layer followed by pooling layer, normalization, com pool norm, 34 00:03:58,421 --> 00:04:01,006 and then a few more conv layers, a pooling layer, 35 00:04:01,006 --> 00:04:03,422 and then several fully connected layers afterwards. 36 00:04:03,422 --> 00:04:09,766 So this actually looks very similar to the LeNet network that we just saw. There's just more layers in total. 37 00:04:09,766 --> 00:04:18,387 There is five of these conv layers, and two fully connected layers before the final fully connected layer going to the output classes. 38 00:04:21,889 --> 00:04:25,930 So let's first get a sense of the sizes involved in the AlexNet. 39 00:04:25,930 --> 00:04:33,128 So if we look at the input to the AlexNet this was trained on ImageNet, with inputs at a size 227 by 227 by 3 images. 40 00:04:33,128 --> 00:04:43,193 And if we look at this first layer which is a conv layer for the AlexNet, it's 11 by 11 filters, 96 of these applied at stride 4. 41 00:04:43,193 --> 00:04:49,323 So let's just think about this for a moment. What's the output volume size of this first layer? 42 00:04:51,788 --> 00:04:53,371 And there's a hint. 43 00:04:57,769 --> 00:05:11,441 So remember we have our input size, we have our convolutional filters, ray. And we have this formula, which is the hint over here that gives you the size of the output dimensions after applying com right? 44 00:05:11,441 --> 00:05:17,632 So remember it was the full image, minus the filter size, divided by the stride, plus one. 45 00:05:17,632 --> 00:05:26,919 So given that that's written up here for you 55, does anyone have a guess at what's the final output size after this conv layer? 46 00:05:26,919 --> 00:05:29,823 [student speaks off mic] 47 00:05:29,823 --> 00:05:32,966 - So I had 55 by 55 by 96, yep. That's correct. 48 00:05:32,966 --> 00:05:38,113 Right so our spatial dimensions at the output are going to be 55 in each dimension and then we have 49 00:05:38,113 --> 00:05:45,391 96 total filters so the depth after our conv layer is going to be 96. So that's the output volume. 50 00:05:45,391 --> 00:05:49,486 And what's the total number of parameters in this layer? 51 00:05:49,486 --> 00:05:52,819 So remember we have 96 11 by 11 filters. 52 00:05:54,851 --> 00:05:57,753 [student speaks off mic] 53 00:05:57,753 --> 00:06:00,753 - [Lecturer] 96 by 11 by 11, almost. 54 00:06:01,945 --> 00:06:05,297 So yes, so I had another by three, yes that's correct. 55 00:06:05,297 --> 00:06:13,632 So each of the filters is going to see through a local region of 11 by 11 by three, right because the input depth was three. 56 00:06:13,632 --> 00:06:18,983 And so, that's each filter size, times we have 96 of these total. 57 00:06:18,983 --> 00:06:23,150 And so there's 35K parameters in this first layer. 58 00:06:26,018 --> 00:06:30,233 Okay, so now if we look at the second layer this is a pooling layer right and in this case 59 00:06:30,233 --> 00:06:34,004 we have three three by three filters applied at stride two. 60 00:06:34,004 --> 00:06:38,171 So what's the output volume of this layer after pooling? 61 00:06:40,701 --> 00:06:44,868 And again we have a hint, very similar to the last question. 62 00:06:51,251 --> 00:06:56,267 Okay, 27 by 27 by 96. Yes that's correct. 63 00:06:57,716 --> 00:07:01,528 Right so the pooling layer is basically going to use this formula that we had here. 64 00:07:01,528 --> 00:07:16,655 Again because these are pooling applied at a stride of two so we're going to use the same formula to determine the spatial dimensions and so the spatial dimensions are going to be 27 by 27, and pooling preserves the depth. 65 00:07:16,655 --> 00:07:21,527 So we had 96 as depth as input, and it's still going to be 96 depth at output. 66 00:07:22,825 --> 00:07:28,127 And next question. What's the number of parameters in this layer? 67 00:07:31,446 --> 00:07:34,354 I hear some muttering. [student answers off mic] 68 00:07:34,354 --> 00:07:36,905 - Nothing. Okay. 69 00:07:36,905 --> 00:07:40,801 Yes, so pooling layer has no parameters, so, kind of a trick question. 70 00:07:42,739 --> 00:07:45,272 Okay, so we can basically, yes, question? 71 00:07:45,272 --> 00:07:47,192 [student speaks off mic] 72 00:07:47,192 --> 00:07:52,180 - The question is, why are there no parameters in the pooling layer? 73 00:07:52,180 --> 00:07:54,551 The parameters are the weights right, that we're trying to learn. 74 00:07:54,551 --> 00:07:56,511 And so convolutional layers have weights that we learn 75 00:07:56,511 --> 00:08:02,236 but pooling all we do is have a rule, we look at the pooling region, and we take the max. 76 00:08:02,236 --> 00:08:05,710 So there's no parameters that are learned. 77 00:08:05,710 --> 00:08:14,250 So we can keep on doing this and you can just repeat the process and it's kind of a good exercise to go through this and figure out the sizes, the parameters, at every layer. 78 00:08:16,473 --> 00:08:22,688 And so if you do this all the way, you can look at this is the final architecture that you can work with. 79 00:08:22,688 --> 00:08:31,920 There's 11 by 11 filters at the beginning, then five by five and some three by three filters. And so these are generally pretty familiar looking sizes 80 00:08:31,920 --> 00:08:39,122 that you've seen before and then at the end we have a couple of fully connected layers of size 4096 and finally the last layer, 81 00:08:39,123 --> 00:08:41,540 is FC8 going to the soft max, 82 00:08:42,689 --> 00:08:46,356 which is going to the 1000 ImageNet classes. 83 00:08:48,039 --> 00:08:56,352 And just a couple of details about this, it was the first use of the ReLu non-linearity that we've talked about that's the most commonly used non-linearity. 84 00:08:56,352 --> 00:09:07,391 They used local response normalization layers basically trying to normalize the response across neighboring channels but this is something that's not really used anymore. 85 00:09:07,391 --> 00:09:11,937 It turned out not to, other people showed that it didn't have so much of an effect. 86 00:09:11,937 --> 00:09:21,769 There's a lot of heavy data augmentation, and so you can look in the paper for more details, but things like flipping, jittering, jittering, color normalization all of these things 87 00:09:21,769 --> 00:09:28,727 which you'll probably find useful for you when you're working on your projects for example, so a lot of data augmentation here. 88 00:09:28,727 --> 00:09:32,419 They also use dropout batch size of 128, 89 00:09:32,419 --> 00:09:37,183 and learned with SGD with momentum which we talked about 90 00:09:37,183 --> 00:09:42,295 in an earlier lecture, and basically just started with a base learning rate of 1e negative 2. 91 00:09:42,295 --> 00:09:50,145 Every time it plateaus, reduce by a factor of 10 and then just keep going. Until they finish training 92 00:09:50,145 --> 00:09:59,012 and a little bit of weight decay and in the end, in order to get the best numbers they also did an ensembling of models and so training multiple of these, 93 00:09:59,012 --> 00:10:03,162 averaging them together and this also gives an improvement in performance. 94 00:10:04,405 --> 00:10:08,781 And so one other thing I want to point out is that if you look at this AlexNet diagram up here, 95 00:10:08,781 --> 00:10:15,235 it looks kind of like the normal comNet diagrams that we've been seeing, except for one difference, 96 00:10:15,235 --> 00:10:21,937 which is that it's, you can see it's kind of split in these two different rows or columns going across. 97 00:10:23,177 --> 00:10:32,905 And so the reason for this is mostly historical note, so AlexNet was trained on GTX580 GPUs older GPUs that only had three gigs of memory. 98 00:10:34,106 --> 00:10:37,255 So it couldn't actually fit this entire network on here, 99 00:10:37,255 --> 00:10:41,773 and so what they ended up doing, was they spread the network across two GPUs. 100 00:10:41,773 --> 00:10:46,455 So on each GPU you would have half of the neurons, or half of the feature maps. 101 00:10:46,455 --> 00:10:51,730 And so for example if you look at this first conv layer, we have 55 by 55 by 96 output, 102 00:10:54,389 --> 00:11:04,155 but if you look at this diagram carefully, you can zoom in later in the actual paper, you can see that, it's actually only 48 depth-wise, on each GPU, 103 00:11:05,049 --> 00:11:08,593 and so they just spread it, the feature maps, directly in half. 104 00:11:10,288 --> 00:11:17,367 And so what happens is that for most of these layers, for example com one, two, four and five, the connections are only with feature maps 105 00:11:17,367 --> 00:11:29,683 on the same GPU, so you would take as input, half of the feature maps that were on the the same GPU as before and you don't look at the full 96 feature maps for example. 106 00:11:29,683 --> 00:11:33,850 You just take as input the 48 in that first layer. 107 00:11:34,767 --> 00:11:47,696 And then there's a few layers so com three, as well as FC six, seven and eight, where here are the GPUs do talk to each other and so there's connections with all feature maps in the preceding layer. 108 00:11:47,696 --> 00:11:54,191 so there's communication across the GPUs, and each of these neurons are then connected to the full depth of the previous input layer. 109 00:11:54,191 --> 00:11:55,627 Question. 110 00:11:55,627 --> 00:12:01,442 - [Student] It says the full simplified AlexNetwork architecture. [mumbles] 111 00:12:05,583 --> 00:12:10,033 - Oh okay, so the question is why does it say full simplified AlexNet architecture here? 112 00:12:10,033 --> 00:12:19,036 It just says that because I didn't put all the details on here, so for example this is the full set of layers in the architecture, and the strides and so on, 113 00:12:19,036 --> 00:12:25,268 but for example the normalization layer, there's other, these details are not written on here. 114 00:12:30,637 --> 00:12:37,849 And then just one little note, if you look at the paper and try and write out the math and architectures and so on, 115 00:12:38,858 --> 00:12:52,721 there's a little bit of an issue on the very first layer they'll say if you'll look in the figure they'll say 224 by 224 , but there's actually some kind of funny pattern going on and so the numbers actually work out if you look at it as 227. 116 00:12:54,982 --> 00:13:04,261 AlexNet was the winner of the ImageNet classification benchmark in 2012, you can see that it cut the error rate by quite a large margin. 117 00:13:05,246 --> 00:13:14,193 It was the first CNN base winner, and it was widely used as a base to our architecture almost ubiquitously from then until a couple years ago. 118 00:13:15,720 --> 00:13:17,980 It's still used quite a bit. 119 00:13:17,980 --> 00:13:24,071 It's used in transfer learning for lots of different tasks and so it was used for basically a long time, 120 00:13:24,071 --> 00:13:33,202 and it was very famous and now though there's been some more recent architectures that have generally just had better performance and so we'll talk about these 121 00:13:33,202 --> 00:13:39,282 next and these are going to be the more common architectures that you'll be wanting to use in practice. 122 00:13:40,853 --> 00:13:47,813 So just quickly first in 2013 the ImageNet challenge was won by something called a ZFNet. 123 00:13:47,813 --> 00:13:48,718 Yes, question. 124 00:13:48,718 --> 00:13:52,729 [student speaks off mic] 125 00:13:52,729 --> 00:13:56,612 - So the question is intuition why AlexNet was so much better than the ones that came before, 126 00:13:56,612 --> 00:14:04,786 DefLearning comNets [mumbles] this is just a very different kind of approach in architecture. 127 00:14:04,786 --> 00:14:09,004 So this was the first deep learning based approach first comNet that was used. 128 00:14:12,445 --> 00:14:18,298 So in 2013 the challenge was won by something called a ZFNet [Zeller Fergus Net] named after the creators. 129 00:14:18,298 --> 00:14:23,749 And so this mostly was improving hyper parameters over the AlexNet. 130 00:14:23,749 --> 00:14:35,735 It had the same number of layers, the same general structure and they made a few changes things like changing the stride size, different numbers of filters and after playing around with these hyper parameters more, 131 00:14:35,735 --> 00:14:41,369 they were able to improve the error rate. But it's still basically the same idea. 132 00:14:41,369 --> 00:14:49,843 So in 2014 there are a couple of architectures that were now more significantly different and made another jump in performance, 133 00:14:49,843 --> 00:14:58,178 and the main difference with these networks first of all was much deeper networks. 134 00:14:58,178 --> 00:15:12,321 So from the eight layer network that was in 2012 and 2013, now in 2014 we had two very close winners that were around 19 layers and 22 layers. So significantly deeper. 135 00:15:12,321 --> 00:15:16,502 And the winner of this was GoogleNet, from Google 136 00:15:16,502 --> 00:15:20,176 but very close behind was something called VGGNet 137 00:15:20,176 --> 00:15:27,421 from Oxford, and on actually the localization challenge VGG got first place in some of the other tracks. 138 00:15:27,421 --> 00:15:31,958 So these were both very, very strong networks. 139 00:15:31,958 --> 00:15:34,663 So let's first look at VGG in a little bit more detail. 140 00:15:34,663 --> 00:15:40,818 And so the VGG network is the idea of much deeper networks and with much smaller filters. 141 00:15:40,818 --> 00:15:50,374 So they increased the number of layers from eight layers in AlexNet right to now they had models with 16 to 19 layers in VGGNet. 142 00:15:52,290 --> 00:16:03,916 And one key thing that they did was they kept very small filter so only three by three conv all the way, which is basically the smallest com filter size that is looking at a little bit of the neighboring pixels. 143 00:16:03,916 --> 00:16:11,485 And they just kept this very simple structure of three by three convs with the periodic pooling all the way through the network. 144 00:16:11,485 --> 00:16:19,948 And it's very simple elegant network architecture, was able to get 7.3% top five error on the ImageNet challenge. 145 00:16:22,651 --> 00:16:27,442 So first the question of why use smaller filters. 146 00:16:27,442 --> 00:16:33,371 So when we take these small filters now we have fewer parameters and we try and stack more of them 147 00:16:33,371 --> 00:16:39,344 instead of having larger filters, have smaller filters with more depth instead, have more of these filters instead, 148 00:16:39,344 --> 00:16:47,202 what happens is that you end up having the same effective receptive field as if you only have one seven by seven convolutional layer. 149 00:16:47,202 --> 00:16:55,466 So here's a question, what is the effective receptive field of three of these three by three conv layers with stride one? 150 00:16:55,466 --> 00:17:01,189 So if you were to stack three three by three conv layers with Stride one what's the effective receptive field, 151 00:17:01,189 --> 00:17:09,754 the total area of the input, spatial area of the input that enure at the top layer of the three layers is looking at. 152 00:17:12,313 --> 00:17:15,987 So I heard fifteen pixels, why fifteen pixels? 153 00:17:15,987 --> 00:17:20,609 - [Student] Okay, so the reason given was because 154 00:17:20,609 --> 00:17:27,369 they overlap-- - Okay, so the reason given was because they overlap. So it's on the right track. 155 00:17:27,369 --> 00:17:35,668 What actually is happening though is you have to see, at the first layer, the receptive field is going to be three by three right? 156 00:17:35,668 --> 00:17:43,193 And then at the second layer, each of these neurons in the second layer is going to look at three by three other first layer 157 00:17:43,193 --> 00:17:51,676 filters, but the corners of these three by three have an additional pixel on each side, that is looking at in the original input layer. 158 00:17:51,676 --> 00:17:56,423 So the second layer is actually looking at five by five receptive field and then if you do this again, 159 00:17:56,423 --> 00:18:04,040 the third layer is looking at three by three in the second layer but this is going to, 160 00:18:04,040 --> 00:18:06,907 if you just draw out this pyramid is looking at seven by seven in the input layer. 161 00:18:06,907 --> 00:18:16,026 So the effective receptive field here is going to be seven by seven. Which is the same as one seven by seven conv layer. 162 00:18:16,026 --> 00:18:21,546 So what happens is that this has the same effective receptive field as a seven by seven conv layer but it's deeper. 163 00:18:21,546 --> 00:18:26,201 It's able to have more non-linearities in there, and it's also fewer parameters. 164 00:18:26,201 --> 00:18:36,536 So if you look at the total number of parameters, each of these conv filters for the three by threes is going to have nine parameters in each conv [mumbles] 165 00:18:38,165 --> 00:18:44,648 three times three, and then times the input depth, so three times three times C, times this total number 166 00:18:44,648 --> 00:18:51,034 of output feature maps, which is again C is we're going to preserve the total number of channels. 167 00:18:51,034 --> 00:19:00,165 So you get three times three, times C times C for each of these layers, and we have three layers so it's going to be three times this number, 168 00:19:00,165 --> 00:19:07,409 compared to if you had a single seven by seven layer then you get, by the same reasoning, seven squared times C squared. 169 00:19:07,409 --> 00:19:11,032 So you're going to have fewer parameters total, which is nice. 170 00:19:15,570 --> 00:19:24,161 So now if we look at this full network here there's a lot of numbers up here that you can go back and look at more carefully but if we look at all 171 00:19:24,161 --> 00:19:30,716 of the sizes and number of parameters the same way that we calculated the example for AlexNet, 172 00:19:30,716 --> 00:19:32,517 this is a good exercise to go through, 173 00:19:32,517 --> 00:19:45,834 we can see that you know going the same way we have a couple of these conv layers and a pooling layer a couple more conv layers, pooling layer, several more conv layers and so on. And so this just keeps going up. 174 00:19:45,834 --> 00:19:52,431 And if you counted the total number of convolutional and fully connected layers, we're going to have 16 in this case for VGG 16, 175 00:19:52,431 --> 00:20:00,478 and then VGG 19, it's just a very similar architecture, but with a few more conv layers in there. 176 00:20:03,021 --> 00:20:05,605 And so the total memory usage of this network, 177 00:20:05,605 --> 00:20:17,196 so just making a forward pass through counting up all of these numbers so in the memory numbers here written in terms of the total numbers, like we calculated earlier, 178 00:20:17,196 --> 00:20:23,125 and if you look at four bytes per number, this is going to be about 100 megs per image, 179 00:20:23,125 --> 00:20:28,727 and so this is the scale of the memory usage that's happening and this is only for a forward pass right, 180 00:20:28,727 --> 00:20:35,470 when you do a backward pass you're going to have to store more and so this is pretty heavy memory wise. 181 00:20:35,470 --> 00:20:44,410 100 megs per image, if you have on five gigs of total memory, then you're only going to be able to store about 50 of these. 182 00:20:47,300 --> 00:20:56,131 And so also the total number of parameters here we have is 138 million parameters in this network, and this compares with 60 million for AlexNet. 183 00:20:56,131 --> 00:20:57,481 Question? 184 00:20:57,481 --> 00:21:00,898 [student speaks off mic] 185 00:21:06,204 --> 00:21:09,920 - So the question is what do we mean by deeper, is it the number of filters, number of layers? 186 00:21:09,920 --> 00:21:14,087 So deeper in this case is always referring to layers. 187 00:21:15,605 --> 00:21:25,216 So there are two usages of the word depth which is confusing one is the depth rate per channel, width by height by depth, you can use the word depth here, 188 00:21:26,942 --> 00:21:34,298 but in general we talk about the depth of a network, this is going to be the total number of layers in the network, and usually in particular 189 00:21:34,298 --> 00:21:43,368 we're counting the total number of weight layers. So the total number of layers with trainable weight, so convolutional layers and fully connected layers. 190 00:21:43,368 --> 00:21:46,868 [student mumbles off mic] 191 00:22:00,810 --> 00:22:06,174 - Okay, so the question is, within each layer what do different filters need? 192 00:22:06,174 --> 00:22:13,043 And so we talked about this back in the comNet lecture, so you can also go back and refer to that, 193 00:22:13,043 --> 00:22:27,616 but each filter is a set of let's say three by three convs, so each filter is looking at a, is a set of weight looking at a three by three value input input depth, and this produces one feature map, 194 00:22:27,616 --> 00:22:31,954 one activation map of all the responses of the different spatial locations. 195 00:22:31,954 --> 00:22:39,646 And then we have we can have as many filters as we want right so for example 96 and each of these is going to produce a feature map. 196 00:22:39,646 --> 00:22:48,368 And so it's just like each filter corresponds to a different pattern that we're looking for in the input that we convolve around and we see the responses everywhere in the input, 197 00:22:48,368 --> 00:22:56,181 we create a map of these and then another filter will we convolve over the image and create another map. 198 00:22:58,761 --> 00:23:00,226 Question. 199 00:23:00,226 --> 00:23:03,643 [student speaks off mic] 200 00:23:07,465 --> 00:23:16,733 - So question is, is there intuition behind, as you go deeper into the network we have more channel depth so more number of filters right and so you can have 201 00:23:17,676 --> 00:23:21,766 any design that you want so you don't have to do this. 202 00:23:21,766 --> 00:23:24,341 In practice you will see this happen a lot of the times 203 00:23:24,341 --> 00:23:30,598 and one of the reasons is people try and maintain kind of a relatively constant level of compute, 204 00:23:30,598 --> 00:23:37,991 so as you go higher up or deeper into your network, you're usually also using basically down sampling 205 00:23:39,606 --> 00:23:45,759 and having smaller total spatial area and then so then they also increase now you increase by depth a little bit, 206 00:23:45,759 --> 00:23:53,367 it's not as expensive now to increase by depth because it's spatially smaller and so, yeah that's just a reason. 207 00:23:53,367 --> 00:23:54,716 Question. 208 00:23:54,716 --> 00:23:58,133 [student speaks off mic] 209 00:23:59,872 --> 00:24:04,653 - So performance-wise is there any reason to use SBN [mumbles] instead of SouthMax [mumbles], 210 00:24:04,653 --> 00:24:09,761 so no, for a classifier you can use either one, and you did that earlier in the class as well, 211 00:24:09,761 --> 00:24:17,242 but in general SouthMax losses, have generally worked well and been standard use for classification here. 212 00:24:18,509 --> 00:24:20,023 Okay yeah one more question. 213 00:24:20,023 --> 00:24:23,523 [student mumbles off mic] 214 00:24:37,902 --> 00:24:45,398 - Yes, so the question is, we don't have to store all of the memory like we can throw away the parts that we don't need and so on? 215 00:24:45,398 --> 00:24:49,221 And yes this is true. Some of this you don't need to keep, 216 00:24:49,221 --> 00:25:02,571 but you're also going to be doing a backwards pass through ware for the most part, when you were doing the chain rule and so on you needed a lot of these activations as part of it and so in large part a lot of this does need to be kept. 217 00:25:04,006 --> 00:25:14,440 So if we look at the distribution of where memory is used and where parameters are, you can see that a lot of memories in these early layers right where you still have 218 00:25:14,440 --> 00:25:24,054 spatial dimensions you're going to have more memory usage and then a lot of the parameters are actually in the last layers, the fully connected layers 219 00:25:24,054 --> 00:25:28,837 have a huge number of parameters right, because we have all of these dense connections. 220 00:25:28,837 --> 00:25:36,999 And so that's something just to know and then keep in mind so later on we'll see some networks actually 221 00:25:36,999 --> 00:25:42,345 get rid of these fully connected layers and be able to save a lot on the number of parameters. 222 00:25:42,345 --> 00:25:48,059 And then just one last thing to point out, you'll also see different ways of calling all of these layers right. 223 00:25:48,059 --> 00:25:56,190 So here I've written out exactly what the layers are. conv3-64 means three by three convs with 64 total filters. 224 00:25:56,190 --> 00:26:05,190 But for VGGNet on this diagram on the right here there's also common ways that people will look at each group of filters, 225 00:26:05,190 --> 00:26:11,822 so each orange block here, as in conv1 part one, so conv1-1, conv1-2, and so on. 226 00:26:11,822 --> 00:26:14,655 So just something to keep in mind. 227 00:26:16,594 --> 00:26:22,120 So VGGNet ended up getting second place in the ImageNet 2014 classification challenge, 228 00:26:22,120 --> 00:26:24,783 first in localization. 229 00:26:24,783 --> 00:26:29,037 They followed a very similar training procedure as Alex Krizhevsky for the AlexNet. 230 00:26:29,037 --> 00:26:38,764 They didn't use local response normalization, so as I mentioned earlier, they found out this didn't really help them, and so they took it out. 231 00:26:38,764 --> 00:26:49,615 You'll see VGG 16 and VGG 19 are common variants of the cycle here, and this is just the number of layers, 19 is slightly deeper than 16. 232 00:26:49,615 --> 00:27:00,366 In practice VGG 19 works very little bit better, and there's a little bit more memory usage, so you can use either but 16 is very commonly used. 233 00:27:01,470 --> 00:27:10,110 For best results, like AlexNet, they did ensembling in order to average several models, and you get better results. 234 00:27:10,110 --> 00:27:20,158 And they also showed in their work that the FC7 features of the last fully connected layer before going to the 1000 ImageNet classes. 235 00:27:20,158 --> 00:27:26,463 The 4096 size layer just before that, is a good feature representation, 236 00:27:26,463 --> 00:27:35,055 that can even just be used as is, to extract these features from other data, and generalized these other tasks as well. 237 00:27:35,055 --> 00:27:37,792 And so FC7 is a good feature representation. 238 00:27:37,792 --> 00:27:39,142 Yeah question. 239 00:27:39,142 --> 00:27:44,432 [student speaks off mic] - Sorry what was the question? 240 00:27:45,939 --> 00:27:50,036 Okay, so the question is what is localization here? 241 00:27:50,036 --> 00:27:57,163 And so this is a task, and we'll talk about it a little bit more in a later lecture on detection and localization so I don't want to 242 00:27:57,163 --> 00:28:03,205 go into detail here but it's basically an image, not just classifying What's the class of the image, 243 00:28:03,205 --> 00:28:09,433 but also drawing a bounding box around where that object is in the image. 244 00:28:09,433 --> 00:28:16,153 And the difference with detection, which is a very related task is that detection there can be multiple instances of this object in the image 245 00:28:16,153 --> 00:28:22,671 localization we're assuming there's just one, this classification but we just how this additional bounding box. 246 00:28:25,343 --> 00:28:32,382 So we looked at VGG which was one of the deep networks from 2014 and then now we'll talk about GoogleNet 247 00:28:32,382 --> 00:28:36,603 which was the other one that won the classification challenge. 248 00:28:37,612 --> 00:28:47,776 So GoogleNet again was a much deeper network with 22 layers but one of the main insights and special things about GoogleNet is that it really 249 00:28:47,776 --> 00:28:57,866 looked at this problem of computational efficiency and it tried to design a network architecture that was very efficient in the amount of compute. 250 00:28:57,866 --> 00:29:05,023 And so they did this using this inception module which we'll go into more detail and basically stacking 251 00:29:05,023 --> 00:29:08,336 a lot of these inception modules on top of each other. 252 00:29:08,336 --> 00:29:19,841 There's also no fully connected layers in this network, so they got rid of that were able to save a lot of parameters and so in total there's only five million parameters which is twelve times less than AlexNet, 253 00:29:19,841 --> 00:29:24,308 which had 60 million even though it's much deeper now. 254 00:29:24,308 --> 00:29:26,975 It got 6.7% top five error. 255 00:29:31,392 --> 00:29:35,363 So what's the inception module? So the idea behind the inception module 256 00:29:35,363 --> 00:29:40,023 is that they wanted to design a good local network typology 257 00:29:40,023 --> 00:29:52,341 and it has this idea of this local topology that's you know you can think of it as a network within a network and then stack a lot of these local typologies one on top of each other. 258 00:29:52,341 --> 00:29:58,387 And so in this local network that they're calling an inception module what they're doing is they're basically 259 00:29:58,387 --> 00:30:07,138 applying several different kinds of filter operations in parallel on top of the same input coming into this same layer. 260 00:30:07,138 --> 00:30:11,896 So we have our input coming in from the previous layer and then we're going to do different kinds of convolutions. 261 00:30:11,896 --> 00:30:25,647 So a one by one conv, right a three by three conv, five by five conv, and then they also have a pooling operation in this case three by three pooling, and so you get all of these different outputs from these different layers, 262 00:30:25,647 --> 00:30:31,499 and then what they do is they concatenate all these filter outputs together depth wise, and so 263 00:30:31,499 --> 00:30:38,893 then this creates one tenser output at the end that is going tom pass on to the next layer. 264 00:30:41,020 --> 00:30:50,015 So if we look at just a naive way of doing this we just do exactly that we have all of these different operations we get the outputs we concatenate them together. 265 00:30:50,015 --> 00:30:52,386 So what's the problem with this? 266 00:30:52,386 --> 00:30:57,717 And it turns out that computational complexity is going to be a problem here. 267 00:30:58,982 --> 00:31:11,156 So if we look more carefully at an example, so here just for as an example I've put one by one conv, 128 filter so three by three conv 192 filters, five by five convs and 96 filters. 268 00:31:11,156 --> 00:31:19,398 Assume everything has basically the stride that's going to maintain the spatial dimensions, and that we have this input coming in. 269 00:31:21,341 --> 00:31:29,231 So what is the output size of the one by one filter with 128 , one by one conv with 128 filters? Who has a guess? 270 00:31:35,910 --> 00:31:39,910 OK so I heard 28 by 28, by 128 which is correct. 271 00:31:40,988 --> 00:31:53,159 So right by one by one conv we're going to maintain spatial dimensions and then on top of that, each conv filter is going to look through the entire 256 depth of the input, 272 00:31:53,159 --> 00:32:00,194 but then the output is going to be, we have a 28 by 28 feature map for each of the 128 filters that we have in this conv layer. 273 00:32:00,194 --> 00:32:02,361 So we get 28 by 28 by 128. 274 00:32:05,469 --> 00:32:14,939 OK and then now if we do the same thing and we look at the filter sizes of the output sizes sorry of all of the different filters here, after the 275 00:32:14,939 --> 00:32:20,379 three by three conv we're going to have this volume of 28 by 28 by 192 right after five by five conv 276 00:32:20,379 --> 00:32:24,559 we have 96 filters here. So 28 by 28 by 96, 277 00:32:24,559 --> 00:32:34,712 and then out pooling layer is just going to keep the same spatial dimension here, so pooling layer will preserve it in depth, 278 00:32:34,712 --> 00:32:40,192 and here because of our stride, we're also going to preserve our spatial dimensions. 279 00:32:41,225 --> 00:32:51,498 And so now if we look at the output size after filter concatenation what we're going to get is 28 by 28, these are all 28 by 28, and we concatenating depth wise. 280 00:32:51,498 --> 00:32:59,330 So we get 28 by 28 times all of these added together, and the total output size is going to be 28 by 28 by 672. 281 00:33:01,113 --> 00:33:10,208 So the input to our inception module was 28 by 28 by 256, then the output from this module is 28 by 28 by 672. 282 00:33:11,466 --> 00:33:17,254 So we kept the same spatial dimensions, and we blew up the depth. 283 00:33:17,254 --> 00:33:18,188 Question. 284 00:33:18,188 --> 00:33:21,905 [student speaks off mic] 285 00:33:21,905 --> 00:33:25,546 OK So in this case, yeah, the question is, how are we getting 28 by 28 for everything? 286 00:33:25,546 --> 00:33:29,307 So here we're doing all the zero padding in order to maintain the spatial dimensions, 287 00:33:29,307 --> 00:33:33,403 and that way we can do this filter concatenation depth-wise. 288 00:33:34,395 --> 00:33:36,233 Question in the back. 289 00:33:36,233 --> 00:33:39,650 [student speaks off mic] 290 00:33:44,824 --> 00:33:47,805 - OK The question is what's the 256 deep at the input, 291 00:33:47,805 --> 00:33:53,814 and so this is not the input to the network, this is the input just to this local module that I'm looking at. 292 00:33:53,814 --> 00:34:00,506 So in this case 256 is the depth of the previous inception module that came just before this. 293 00:34:00,506 --> 00:34:08,438 And so now coming out we have 28 by 28 by 672, and that's going to be the input to the next inception module. 294 00:34:08,438 --> 00:34:09,915 Question. 295 00:34:09,916 --> 00:34:13,333 [student speaks off mic] 296 00:34:17,039 --> 00:34:23,181 - Okay the question is, how did we get 28 by 28 by 128 for the first one, the first conv, 297 00:34:23,181 --> 00:34:34,058 and this is basically it's a one by one convolution right, so we're going to take this one by one convolution slide it across our 28 by 28 by 256 input spatially 298 00:34:35,485 --> 00:34:41,956 where it's at each location, it's going to multiply, it's going to do a [mumbles] through the entire 256 depth, and so we do this 299 00:34:41,956 --> 00:34:46,983 one by one conv slide it over spatially and we get a feature map out that's 28 by 28 by one. 300 00:34:46,983 --> 00:34:58,311 There's one number at each spatial location coming out, and each filter produces one of these 28 by 28 by one maps, and we have here a total 128 filters, 301 00:35:01,050 --> 00:35:04,800 and that's going to produce 28 by 28, by 128. 302 00:35:05,809 --> 00:35:10,403 OK so if you look at the number of operations that are happening in the convolutional layer, 303 00:35:10,403 --> 00:35:22,553 let's look at the first one for example this one by one conv as I was just saying at each each location we're doing a one by one by 256 dot product. 304 00:35:24,545 --> 00:35:28,358 So there's 256 multiply operations happening here 305 00:35:28,358 --> 00:35:37,865 and then for each filter map we have 28 by 28 spatial locations, so that's the first 28 times 28 first two numbers that are multiplied here. 306 00:35:37,865 --> 00:35:53,859 These are the spatial locations for each filter map, and so we have to do this to 25 60 multiplication each one of these then we have 128 total filters at this layer, or we're producing 128 total feature maps. 307 00:35:53,859 --> 00:36:01,221 And so the total number of these operations here is going to be 28 times 28 times 128 times 256. 308 00:36:02,129 --> 00:36:10,349 And so this is going to be the same for, you can think about this for the three by three conv, and the five by five conv, that's exactly the same principle. 309 00:36:10,349 --> 00:36:16,690 And in total we're going to get 854 million operations that are happening here. 310 00:36:17,968 --> 00:36:21,191 - [Student] And the 128, 192, and 96 are just values 311 00:36:22,131 --> 00:36:29,044 - Question the 128, 192 and 256 are values that I picked. Yes, these are not values that I just came up with. 312 00:36:29,044 --> 00:36:35,594 They are similar to the ones that you will see in like a particular layer of inception net, 313 00:36:35,594 --> 00:36:43,103 so in GoogleNet basically, each module has a different set of these kinds of parameters, and I picked one that was similar to one of these. 314 00:36:45,089 --> 00:36:49,046 And so this is very expensive computationally right, these these operations. 315 00:36:49,046 --> 00:36:55,507 And then the other thing that I also want to note is that the pooling layer also adds to this problem because it preserves the whole feature depth. 316 00:36:57,062 --> 00:37:03,519 So at every layer your total depth can only grow right, you're going to take the full featured depth 317 00:37:03,519 --> 00:37:10,513 from your pooling layer, as well as all the additional feature maps from the conv layers and add these up together. 318 00:37:10,513 --> 00:37:18,960 So here our input was 256 depth and our output is 672 depth and you're just going to keep increasing this as you go up. 319 00:37:21,920 --> 00:37:25,441 So how do we deal with this and how do we keep this more manageable? 320 00:37:25,441 --> 00:37:36,181 And so one of the key insights that GoogleNet used was that well we can we can address this by using bottleneck layers and try and project these feature maps 321 00:37:36,181 --> 00:37:43,174 to lower dimension before our our convolutional operations, so before our expensive layers. 322 00:37:45,007 --> 00:37:46,642 And so what exactly does that mean? 323 00:37:46,642 --> 00:37:58,080 So reminder one by one convolution, I guess we were just going through this but it's taking your input volume, it's performing a dot product at each spatial location and what it does is it preserves spatial dimension 324 00:38:00,141 --> 00:38:06,139 but it reduces the depth and it reduces that by projecting your input depth to a lower dimension. 325 00:38:06,139 --> 00:38:10,515 It just takes it's basically like a linear combination of your input feature maps. 326 00:38:12,880 --> 00:38:18,199 And so this main idea is that it's projecting your depth down and so the inception module 327 00:38:18,199 --> 00:38:29,085 takes these one by one convs and adds these at a bunch of places in these modules where there's going to be, in order to alleviate this expensive compute. 328 00:38:29,085 --> 00:38:36,162 So before the three by three and five by five conv layers, it puts in one of these one by one convolutions. 329 00:38:36,162 --> 00:38:42,315 And then after the pooling layer it also puts an additional one by one convolution. 330 00:38:43,284 --> 00:38:47,609 Right so these are the one by one bottleneck layers that are added in. 331 00:38:48,562 --> 00:38:52,736 And so how does this change the math that we were looking at earlier? 332 00:38:52,736 --> 00:38:58,589 So now basically what's happening is that we still have the same input here 28 by 28 by 256, 333 00:38:58,589 --> 00:39:12,856 but these one by one convs are going to reduce the depth dimension and so you can see before the three by three convs, if I put a one by one conv with 64 filters, my output from that is going to be, 28 by 28 by 64. 334 00:39:14,184 --> 00:39:25,154 So instead of now going into the three by three convs afterwards instead of 28 by 28 by 256 coming in, we only have a 28 by 28, by 64 block coming in. 335 00:39:25,154 --> 00:39:31,454 And so this is now reducing the smaller input going into these conv layers, the same thing for 336 00:39:31,454 --> 00:39:40,499 the five by five conv, and then for the pooling layer, after the pooling comes out, we're going to reduce the depth after this. 337 00:39:41,562 --> 00:39:51,214 And so, if you work out the math the same way for all of the convolutional ops here, adding in now all these one by one convs on top of the three by threes and five by fives, 338 00:39:51,214 --> 00:40:02,499 the total number of operations is 358 million operations, so it's much less than the 854 million that we had in the naive version, and so you can see how you 339 00:40:02,499 --> 00:40:10,438 can use this one by one conv, and the filter size for that to control your computation. 340 00:40:10,438 --> 00:40:12,118 Yes, question in the back. 341 00:40:12,118 --> 00:40:15,535 [student speaks off mic] 342 00:40:23,525 --> 00:40:30,979 - Yes, so the question is, have you looked into what information might be lost by doing this one by one conv at the beginning. 343 00:40:30,979 --> 00:40:35,112 And so there might be some information loss, 344 00:40:35,112 --> 00:40:46,013 but at the same time if you're doing these projections you're taking a linear combination of these input feature maps which has redundancy in them, you're taking combinations of them, 345 00:40:47,623 --> 00:40:59,422 and you're also introducing an additional non-linearity after the one by one conv, so it also actually helps in that way with adding a little bit more depth and so, I don't think there's a rigorous analysis 346 00:40:59,422 --> 00:41:07,314 of this, but basically in general this works better and there's reasons why it helps as well. 347 00:41:07,314 --> 00:41:15,627 OK so here we have, we're basically using these one by one convs to help manage our computational complexity, 348 00:41:15,627 --> 00:41:20,450 and then what GooleNet does is it takes these inception modules and it's going to stack all these together. 349 00:41:20,450 --> 00:41:22,827 So this is a full inception architecture. 350 00:41:22,827 --> 00:41:32,773 And if we look at this a little bit more detail, so here I've flipped it, because it's so big, it's not going to fit vertically any more on the slide. 351 00:41:32,773 --> 00:41:41,867 So what we start with is we first have this stem network, so this is more the kind of vanilla plain conv net that we've seen earlier [mumbles] six sequence of layers. 352 00:41:43,256 --> 00:41:48,570 So conv pool a couple of convs in another pool just to get started and then after that 353 00:41:48,570 --> 00:41:54,911 we have all of our different our multiple inception modules all stacked on top of each other, 354 00:41:54,911 --> 00:41:58,433 and then on top we have our classifier output. 355 00:41:58,433 --> 00:42:08,982 And notice here that they've really removed the expensive fully connected layers it turns out that the model works great without them, even and you reduce a lot of parameters. 356 00:42:08,982 --> 00:42:17,098 And then what they also have here is, you can see these couple of extra stems coming out and these are auxiliary classification outputs 357 00:42:18,866 --> 00:42:23,273 and so these are also you know just a little mini networks 358 00:42:23,273 --> 00:42:29,217 with an average pooling, a one by one conv, a couple of fully connected layers here going to 359 00:42:29,217 --> 00:42:35,702 the soft Max and also a 1000 way SoftMax with the ImageNet classes. 360 00:42:35,702 --> 00:42:41,350 And so you're actually using your ImageNet training classification loss in three separate places here. 361 00:42:41,350 --> 00:42:51,752 The standard end of the network, as well as in these two places earlier on in the network, and the reason they do that is just this is a deep network 362 00:42:51,752 --> 00:43:02,140 and they found that having these additional auxiliary classification outputs, you get more gradient training injected at the earlier layers, 363 00:43:02,140 --> 00:43:13,484 and so more just helpful signal flowing in because these intermediate layers should also be helpful. You should be able to do classification based off some of these as well. 364 00:43:13,484 --> 00:43:20,711 And so this is the full architecture, there's 22 total layers with weights and so 365 00:43:20,711 --> 00:43:29,474 within each of these modules each of those one by one, three by three, five by five is a weight layer, just including all of these parallel layers, 366 00:43:29,474 --> 00:43:44,128 and in general it's a relatively more carefully designed architecture and part of this is based on some of these intuitions that we're talking about and part of them 367 00:43:44,128 --> 00:43:55,511 also is just you know Google the authors they had huge clusters and they're cross validating across all kinds of design choices and this is what ended up working well. 368 00:43:55,511 --> 00:43:57,105 Question? 369 00:43:57,105 --> 00:44:00,522 [student speaks off mic] 370 00:44:24,442 --> 00:44:32,457 - Yeah so the question is, are the auxiliary outputs actually useful for the final classification, to use these as well? 371 00:44:32,457 --> 00:44:39,164 I think when they're training them they do average all these for the losses coming out. I think they are helpful. 372 00:44:39,164 --> 00:44:49,272 I can't remember if in the final architecture, whether they average all of these or just take one, it seems very possible that they would use all of them, but you'll need to check on that. 373 00:44:49,272 --> 00:44:52,689 [student speaks off mic] 374 00:44:58,352 --> 00:45:10,219 - So the question is for the bottleneck layers, is it possible to use some other types of dimensionality reduction and yes you can use other kinds of dimensionality reduction. 375 00:45:10,219 --> 00:45:17,138 The benefits here of this one by one conv is, you're getting this effect, but it's all, you know it's a conv layer just like any other. 376 00:45:17,138 --> 00:45:26,180 You have the soul network of these, you just train it this full network back [mumbles] through everything, and it's learning how to combine the previous feature maps. 377 00:45:28,601 --> 00:45:30,730 Okay yeah, question in the back. 378 00:45:30,730 --> 00:45:34,147 [student speaks off mic] 379 00:45:35,807 --> 00:45:42,549 - Yes so, question is are any weights shared or all they all separate and yeah, 380 00:45:42,549 --> 00:45:45,542 all of these layers have separate weights. 381 00:45:45,542 --> 00:45:46,690 Question. 382 00:45:46,690 --> 00:45:50,107 [student speaks off mic] 383 00:45:56,784 --> 00:46:00,143 - Yes so the question is why do we have to inject gradients at earlier layers? 384 00:46:00,143 --> 00:46:07,785 So our classification output at the very end, where we get a gradient on this, it's passed all the way back through the chain roll 385 00:46:09,599 --> 00:46:21,178 but the problem is when you have very deep networks and you're going all the way back through these, some of this gradient signal can become minimized and lost closer to the beginning, and so that's why having 386 00:46:21,178 --> 00:46:28,377 these additional ones in earlier parts can help provide some additional signal. 387 00:46:28,377 --> 00:46:32,667 [student mumbles off mic] 388 00:46:32,667 --> 00:46:35,853 - So the question is are you doing back prop all the times for each output. 389 00:46:35,853 --> 00:46:41,446 No it's just one back prop all the way through, and you can think of these three, 390 00:46:41,446 --> 00:46:48,075 you can think of there being kind of like an addition at the end of these if you were to draw up your computational graph, and so you get your 391 00:46:48,075 --> 00:46:54,004 final signal and you can just take all of these gradients and just back plot them all the way through. 392 00:46:54,004 --> 00:46:58,970 So it's as if they were added together at the end in a computational graph. 393 00:46:58,970 --> 00:47:05,423 OK so in the interest of time because we still have a lot to get through, can take other questions offline. 394 00:47:07,353 --> 00:47:10,520 Okay so GoogleNet basically 22 layers. 395 00:47:11,441 --> 00:47:15,983 It has an efficient inception module, there's no fully connected layers. 396 00:47:15,983 --> 00:47:22,026 12 times fewer parameters than AlexNet, and it's the ILSVRC 2014 classification winner. 397 00:47:25,228 --> 00:47:30,869 And so now let's look at the 2015 winner, which is the ResNet network and so here 398 00:47:30,869 --> 00:47:38,339 this idea is really, this revolution of depth net right. We were starting to increase depth in 2014, and here we've 399 00:47:38,339 --> 00:47:45,616 just had this hugely deeper model at 152 layers was the ResNet architecture. 400 00:47:45,616 --> 00:47:48,846 And so now let's look at that in a little bit more detail. 401 00:47:48,846 --> 00:47:54,286 So the ResNet architecture, is getting extremely deep networks, much deeper than any other networks 402 00:47:54,286 --> 00:48:00,479 before and it's doing this using this idea of residual connections which we'll talk about. 403 00:48:00,479 --> 00:48:04,158 And so, they had 152 layer model for ImageNet. 404 00:48:04,158 --> 00:48:07,969 They were able to get 3.5 of 7% top 5 error with this 405 00:48:07,969 --> 00:48:18,114 and the really special thing is that they swept all classification and detection contests in the ImageNet mart benchmark and this other benchmark called COCO. 406 00:48:18,114 --> 00:48:23,546 It just basically won everything. So it was just clearly better than everything else. 407 00:48:25,055 --> 00:48:32,538 And so now let's go into a little bit of the motivation behind ResNet and residual connections that we'll talk about. 408 00:48:32,538 --> 00:48:41,939 And the question that they started off by trying to answer is what happens when we try and stack deeper and deeper layers on a plain convolutional neural network? 409 00:48:41,939 --> 00:48:53,874 So if we take something like VGG or some normal network that's just stacks of conv and pool layers on top of each other can we just continuously extend these, get deeper layers and just do better? 410 00:48:55,601 --> 00:48:58,421 And and the answer is no. 411 00:48:58,421 --> 00:49:06,599 So if you so if you look at what happens when you get deeper, so here I'm comparing a 20 layer network and a 56 layer network and so this is just a plain 412 00:49:09,498 --> 00:49:16,817 kind of network you'll see that in the test error here on the right the 56 layer network is doing worse than the 28 layer network. 413 00:49:16,817 --> 00:49:19,771 So the deeper network was not able to do better. 414 00:49:19,771 --> 00:49:29,680 But then the really weird thing is now if you look at the training error right we here have again the 20 layer network and a 56 layer network. 415 00:49:29,680 --> 00:49:40,271 The 56 layer network, one of the obvious problems you think, I have a really deep network, I have tons of parameters maybe it's probably starting to over fit at some point. 416 00:49:41,294 --> 00:49:48,985 But what actually happens is that when you're over fitting you would expect to have very good, very low training error rate, and just bad test error, 417 00:49:48,985 --> 00:49:55,511 but what's happening here is that in the training error the 56 layer network is also doing worse than the 20 layer network. 418 00:49:56,833 --> 00:50:01,545 And so even though the deeper model performs worse, this is not caused by over-fitting. 419 00:50:03,462 --> 00:50:10,253 And so the hypothesis of the ResNet creators is that the problem is actually an optimization problem. 420 00:50:10,253 --> 00:50:15,611 Deeper models are just harder to optimize, than more shallow networks. 421 00:50:16,835 --> 00:50:23,263 And the reasoning was that well, a deeper model should be able to perform at least as well as a shallower model. 422 00:50:23,263 --> 00:50:32,330 You can have actually a solution by construction where you just take the learned layers from your shallower model, you just copy these over and then for the remaining additional 423 00:50:32,330 --> 00:50:35,192 deeper layers you just add identity mappings. 424 00:50:35,192 --> 00:50:39,533 So by construction this should be working just as well as the shallower layer. 425 00:50:39,533 --> 00:50:46,295 And your model that weren't able to learn properly, it should be able to learn at least this. 426 00:50:46,295 --> 00:51:00,594 And so motivated by this their solution was well how can we make it easier for our architecture, our model to learn these kinds of solutions, or at least something like this? 427 00:51:00,594 --> 00:51:11,794 And so their idea is well instead of just stacking all these layers on top of each other and having every layer try and learn some underlying mapping 428 00:51:11,794 --> 00:51:21,708 of a desired function, lets instead have these blocks, where we try and fit a residual mapping, instead of a direct mapping. 429 00:51:21,708 --> 00:51:28,220 And so what this looks like is here on this right where the input to these block is just the input coming in 430 00:51:29,818 --> 00:51:48,499 and here we are going to use our, here on the side, we're going to use our layers to try and fit some residual of our desire to H of X, minus X instead of the desired function H of X directly. 431 00:51:49,450 --> 00:51:55,827 And so basically at the end of this block we take the step connection on this right here, this loop, 432 00:51:55,827 --> 00:52:07,241 where we just take our input, we just use pass it through as an identity, and so if we had no weight layers in between it was just going to be the identity it would be the same thing as the output, but now we use 433 00:52:07,241 --> 00:52:12,562 our additional weight layers to learn some delta, for some residual from our X. 434 00:52:14,067 --> 00:52:24,502 And so now the output of this is going to be just our original R X plus some residual that we're going to call it. It's basically a delta and so the idea is that 435 00:52:24,502 --> 00:52:31,428 now the output it should be easy for example, in the case where identity is ideal, 436 00:52:32,510 --> 00:52:39,249 to just squash all of these weights of F of X from our weight layers just set it to all zero 437 00:52:39,249 --> 00:52:48,578 for example, then we're just going to get identity as the output, and we can get something, for example, close to this solution by construction that we had earlier. 438 00:52:48,578 --> 00:53:00,962 Right, so this is just a network architecture that says okay, let's try and fit this, learn how our weight layers residual, and be something close, that way it'll more likely be something close to X, 439 00:53:00,962 --> 00:53:05,388 it's just modifying X, than to learn exactly this full mapping of what it should be. 440 00:53:05,388 --> 00:53:08,249 Okay, any questions about this? 441 00:53:08,249 --> 00:53:09,189 [student speaks off mic] 442 00:53:09,189 --> 00:53:12,689 - Question is is there the same dimension? 443 00:53:13,770 --> 00:53:17,603 So yes these two paths are the same dimension. 444 00:53:18,752 --> 00:53:32,288 In general either it's the same dimension, or what they actually do is they have these projections and shortcuts and they have different ways of padding to make things work out to be the same dimension. Depth wise. 445 00:53:32,288 --> 00:53:33,395 Yes 446 00:53:33,395 --> 00:53:39,120 - [Student] When you use the word residual you were talking about [mumbles off mic] 447 00:53:45,857 --> 00:53:53,638 - So the question is what exactly do we mean by residual this output of this transformation is a residual? 448 00:53:53,638 --> 00:54:01,899 So we can think of our output here right as this F of X plus X, where F of X is the output of our transformation 449 00:54:01,899 --> 00:54:06,650 and then X is our input, just passed through by the identity. 450 00:54:06,650 --> 00:54:17,198 So we'd like to using a plain layer, what we're trying to do is learn something like H of X, but what we saw earlier is that it's hard to learn H of X. 451 00:54:17,198 --> 00:54:20,671 It's a good H of X as we get very deep networks. 452 00:54:20,671 --> 00:54:29,438 And so here the idea is let's try and break it down instead of as H of X is equal to F of X plus, and let's just try and learn F of X. 453 00:54:29,438 --> 00:54:39,741 And so instead of learning directly this H of X we just want to learn what is it that we need to add or subtract to our input as we move on to the next layer. 454 00:54:39,741 --> 00:54:45,889 So you can think of it as kind of modifying this input, in place in a sense. We have-- 455 00:54:45,889 --> 00:54:49,121 [interrupted by student mumbling off mic] 456 00:54:49,121 --> 00:54:58,129 - The question is, when we're saying the word residual are we talking about F of X? Yeah. So F of X is what we're calling the residual. And it just has that meaning. 457 00:55:01,477 --> 00:55:03,941 Yes another question. 458 00:55:03,941 --> 00:55:07,441 [student mumbles off mic] 459 00:55:11,319 --> 00:55:20,145 - So the question is in practice do we just sum F of X and X together, or do we learn some weighted combination and you just do a direct sum. 460 00:55:20,145 --> 00:55:28,809 Because when you do a direct sum, this is the idea of let me just learn what is it I have to add or subtract onto X. 461 00:55:30,652 --> 00:55:34,463 Is this clear to everybody, the main intuition? 462 00:55:34,463 --> 00:55:35,361 Question. 463 00:55:35,361 --> 00:55:38,778 [student speaks off mic] 464 00:55:40,721 --> 00:55:47,099 - Yeah, so the question is not clear why is it that learning the residual should be easier than learning the direct mapping? 465 00:55:47,099 --> 00:55:58,747 And so this is just their hypotheses, and a hypotheses is that if we're learning the residual you just have to learn what's the delta to X right? 466 00:55:58,747 --> 00:56:16,101 And if our hypotheses is that generally even something like our solution by construction, where we had some number of these shallow layers that were learned and we had all these identity mappings at the top this was a solution that should have been 467 00:56:16,101 --> 00:56:23,985 good, and so that implies that maybe a lot of these layers, actually something just close to identity, would be a good layer 468 00:56:23,985 --> 00:56:30,954 And so because of that, now we formulate this as being able to learn the identity plus just a little delta. 469 00:56:30,954 --> 00:56:34,315 And if really the identity is best we just make 470 00:56:34,315 --> 00:56:40,363 F of X squashes transformation to just be zero, which is something that's relatively, might seem easier to learn, 471 00:56:40,363 --> 00:56:44,784 also we're able to get things that are close to identity mappings. 472 00:56:44,784 --> 00:56:50,966 And so again this is not something that's necessarily proven or anything it's just the intuition and hypothesis, 473 00:56:50,966 --> 00:56:58,708 and then we'll also see later some works where people are actually trying to challenge this and say oh maybe it's not actually the residuals that are so necessary, 474 00:56:58,708 --> 00:57:07,507 but at least this is the hypothesis for this paper, and in practice using this model, it was able to do very well. 475 00:57:07,507 --> 00:57:08,810 Question. 476 00:57:08,810 --> 00:57:12,227 [student speaks off mic] 477 00:57:41,813 --> 00:57:49,128 - Yes so the question is have people tried other ways of combining the inputs from previous layers and yes 478 00:57:49,128 --> 00:57:56,747 so this is basically a very active area of research on and how we formulate all these connections, and what's connected to what in all of these structures. 479 00:57:56,747 --> 00:58:04,695 So we'll see a few more examples of different network architectures briefly later but this is an active area of research. 480 00:58:05,658 --> 00:58:12,093 OK so we basically have all of these residual blocks that are stacked on top of each other. 481 00:58:12,093 --> 00:58:14,788 We can see the full resident architecture. 482 00:58:14,788 --> 00:58:27,299 Each of these residual blocks has two three by three conv layers as part of this block and there's also been work just saying that this happens to be a good configuration that works well. 483 00:58:27,299 --> 00:58:29,828 We stack all these blocks together very deeply. 484 00:58:29,828 --> 00:58:40,851 Another thing like with this very deep architecture it's basically also enabling up to 150 layers deep of this, and then what we do is we stack 485 00:58:46,582 --> 00:58:53,982 all these and periodically we also double the number of filters and down sample spatially using stride two when we do that. 486 00:58:55,856 --> 00:59:03,867 And then we have this additional [mumbles] at the very beginning of our network and at the end we also hear, don't have any fully connected layers 487 00:59:03,867 --> 00:59:08,641 and we just have a global average pooling layer that's going to average over everything spatially, 488 00:59:08,641 --> 00:59:12,808 and then be input into the last 1000 way classification. 489 00:59:14,694 --> 00:59:16,991 So this is the full ResNet architecture 490 00:59:16,991 --> 00:59:21,935 and it's very simple and elegant just stacking up all of these ResNet blocks on top of each other, 491 00:59:21,935 --> 00:59:29,389 and they have total depths of up to 34, 50, 100, and they tried up to 152 for ImageNet. 492 00:59:34,230 --> 00:59:43,964 OK so one additional thing just to know is that for a very deep network, so the ones that are more than 50 layers deep, they also use bottleneck layers 493 00:59:43,964 --> 00:59:46,663 similar to what GoogleNet did in order to improve efficiency 494 00:59:46,663 --> 00:59:57,195 and so within each block now you're going to, what they did is, have this one by one conv filter, that first projects it down to a smaller depth. 495 00:59:57,195 --> 01:00:07,949 So again if we are looking at let's say 28 by 28 by 256 implant, we do this one by one conv, it's taking it's projecting the depth down. We get 28 by 28 by 64. 496 01:00:09,107 --> 01:00:18,486 Now your convolution your three by three conv, in here they only have one, is operating over this reduced step so it's going to be less expensive, 497 01:00:18,486 --> 01:00:29,870 and then afterwards they have another one by one conv that projects the depth back up to 256, and so, this is the actual block that you'll see in deeper networks. 498 01:00:33,021 --> 01:00:41,282 So in practice the ResNet also uses batch normalization after every conv layer, they use Xavier initialization 499 01:00:41,282 --> 01:00:50,578 with an extra scaling factor that they helped introduce to improve the initialization trained with SGD + momentum. 500 01:00:51,604 --> 01:00:59,470 Their learning rate they use a similar learning rate type of schedule where you decay your learning rate when your validation error plateaus. 501 01:01:01,751 --> 01:01:05,874 Mini batch size 256, a little bit of weight decay and no drop out. 502 01:01:07,645 --> 01:01:13,581 And so experimentally they were able to show that they were able to train these very deep networks, without degrading. 503 01:01:13,581 --> 01:01:19,060 They were able to have basically good gradient flow coming all the way back down through the network. 504 01:01:19,060 --> 01:01:22,625 They tried up to 152 layers on ImageNet, 505 01:01:22,625 --> 01:01:26,632 1200 on Cifar, which is a, you have played with it, 506 01:01:26,632 --> 01:01:35,024 but a smaller data set and they also saw that now you're deeper networks are able to achieve lower training errors as expected. 507 01:01:36,303 --> 01:01:44,543 So you don't have the same strange plots that we saw earlier where the behavior was in the wrong direction. 508 01:01:44,543 --> 01:01:54,843 And so from here they were able to sweep first place at all of the ILSVRC competitions, and all of the COCO competitions in 2015 by a significant margins. 509 01:01:56,152 --> 01:02:06,649 Their total top five error was 3.6 % for a classification and this is actually better than human performance in the ImageNet paper. 510 01:02:08,902 --> 01:02:22,551 There was also a human metric that came from actually [mumbles] our lab Andre Kapathy spent like a week training himself and then basically did all of, did this task himself 511 01:02:24,730 --> 01:02:34,191 and was I think somewhere around 5-ish %, and so I was basically able to do better than the then that human at least. 512 01:02:36,175 --> 01:02:42,069 Okay, so these are kind of the main networks that have been used recently. 513 01:02:42,069 --> 01:02:48,004 We had AlexNet starting off with first, VGG and GoogleNet are still very popular, 514 01:02:48,004 --> 01:02:58,218 but ResNet is the most recent best performing model that if you're looking for something training a new network ResNet is available, you should try working with it. 515 01:03:00,154 --> 01:03:06,403 So just quickly looking at some of this getting a better sense of the complexity involved. 516 01:03:06,403 --> 01:03:14,120 So here we have some plots that are sorted by performance so this is top one accuracy here, and higher is better. 517 01:03:15,275 --> 01:03:21,540 And so you'll see a lot of these models that we talked about, as well as some different versions of them so, this GoogleNet inception thing, 518 01:03:21,540 --> 01:03:31,389 I think there's like V2, V3 and the best one here is V4, which is actually a ResNet plus inception combination, so these are just kind of 519 01:03:31,389 --> 01:03:39,159 more incremental, smaller changes that they've built on top of them, and so that's the best performing model here. 520 01:03:39,159 --> 01:03:45,446 And if we look on the right, these plots of their computational complexity here it's sorted. 521 01:03:47,686 --> 01:03:52,313 The Y axis is your top one accuracy so higher is better. 522 01:03:52,313 --> 01:04:03,074 The X axis is your operations and so the more to the right, the more ops you're doing, the more computationally expensive and then the bigger the circle, your circle is your memory usage, 523 01:04:03,074 --> 01:04:07,251 so the gray circles are referenced here, but the bigger the circle the more memory usage 524 01:04:07,251 --> 01:04:16,206 and so here we can see that VGG these green ones are kind of the least efficient. They have the biggest memory, the most operations, 525 01:04:16,206 --> 01:04:18,623 but they they do pretty well. 526 01:04:19,838 --> 01:04:29,275 GoogleNet is the most efficient here. It's way down on the operation side, as well as a small little circle for memory usage. 527 01:04:29,275 --> 01:04:39,411 AlexNet, our earlier model, has lowest accuracy. It's relatively smaller compute, because it's a smaller network, but it's also not particularly memory efficient. 528 01:04:41,309 --> 01:04:46,216 And then ResNet here, we have moderate efficiency. 529 01:04:46,216 --> 01:04:52,500 It's kind of in the middle, both in terms of memory and operations, and it has the highest accuracy. 530 01:04:56,029 --> 01:04:58,028 And so here also are some additional plots. 531 01:04:58,028 --> 01:05:14,868 You can look at these more on your own time, but this plot on the left is showing the forward pass time and so this is in milliseconds and you can up at the top VGG forward passes about 200 milliseconds you can get about five frames per second with this, and this is sorted in order. 532 01:05:14,868 --> 01:05:25,883 There's also this plot on the right looking at power consumption and if you look more at this paper here, there's further analysis of these kinds of computational comparisons. 533 01:05:30,604 --> 01:05:38,750 So these were the main architectures that you should really know in-depth and be familiar with, and be thinking about actively using. 534 01:05:38,750 --> 01:05:48,263 But now I'm going just to go briefly through some other architectures that are just good to know either historical inspirations or more recent areas of research. 535 01:05:50,716 --> 01:05:56,342 So the first one Network in Network, this is from 2014, and the idea behind this 536 01:06:00,529 --> 01:06:16,118 is that we have these vanilla convolutional layers but we also have these, this introduces the idea of MLP conv layers they call it, which are micro networks or basically network within networth, the name of the paper. 537 01:06:16,118 --> 01:06:23,152 Where within each conv layer trying to stack an MLP with a couple of fully connected layers on top of 538 01:06:23,152 --> 01:06:29,167 just the standard conv and be able to compute more abstract features for these local patches right. 539 01:06:29,167 --> 01:06:41,975 So instead of sliding just a conv filter around, it's sliding a slightly more complex hierarchical set of filters around and using that to get the activation maps. 540 01:06:41,975 --> 01:06:47,941 And so, it uses these fully connected, or basically one by one conv kind of layers. 541 01:06:47,941 --> 01:06:57,196 It's going to stack them all up like the bottom diagram here where we just have these networks within networks stacked in each of the layers. 542 01:06:57,196 --> 01:07:10,102 And the main reason to know this is just it was kind of a precursor to GoogleNet and ResNet in 2014 with this idea of bottleneck layers that you saw used very heavily in there. 543 01:07:10,102 --> 01:07:22,070 And it also had a little bit of philosophical inspiration for GoogleNet for this idea of a local network typology network in network that they also used, with a different kind of structure. 544 01:07:24,238 --> 01:07:36,759 Now I'm going to talk about a series of works, on, or works since ResNet that are mostly geared towards improving resNet and so this is more recent research has been done since then. 545 01:07:36,759 --> 01:07:39,911 I'm going to go over these pretty fast, and so just at a very high level. 546 01:07:39,911 --> 01:07:44,754 If you're interested in any of these you should look at the papers, to have more details. 547 01:07:45,755 --> 01:07:55,719 So the authors of ResNet a little bit later on in 2016 also had this paper where they improved the ResNet block design. 548 01:07:56,742 --> 01:08:03,015 And so they basically adjusted what were the layers that were in the ResNet block path, 549 01:08:03,015 --> 01:08:18,861 and showed this new structure was able to have a more direct path in order for propagating information throughout the network, and you want to have a good path to propagate information all the way up, and then back up all the way down again. 550 01:08:18,861 --> 01:08:25,319 And so they showed that this new block was better for that and was able to give better performance. 551 01:08:25,319 --> 01:08:28,959 There's also a Wide networks which this paper 552 01:08:28,959 --> 01:08:40,228 argued that while ResNets made networks much deeper as well as added these residual connections and their argument was that residuals are really the important factor. 553 01:08:40,228 --> 01:08:45,290 Having this residual construction, and not necessarily having extremely deep networks. 554 01:08:45,290 --> 01:08:52,794 And so what they did was they used wider residual blocks, and so what this means is just more filters in every conv layer. 555 01:08:52,794 --> 01:09:02,661 So before we might have F filters per layer and they use these factors of K and said well, every layer it's going to be F times K filters instead. 556 01:09:02,663 --> 01:09:11,502 And so, using these wider layers they showed that their 50 layer wide ResNet was able to out-perform the 152 layer original ResNet, 557 01:09:13,754 --> 01:09:23,035 and it also had the additional advantages of increasing with this, even with the same amount of parameters, tit's more computationally efficient 558 01:09:23,035 --> 01:09:26,922 because you can parallelize these with operations more easily. 559 01:09:26,923 --> 01:09:39,546 Right just convolutions with more neurons just spread across more kernels as opposed to depth that's more sequential, so it's more computationally efficient to increase your width. 560 01:09:39,546 --> 01:09:49,817 - So here you can see this work is starting to trying to understand the contributions of width and depth and residual connections, - and making some arguments for one way versus the other. 561 01:09:49,817 --> 01:09:58,125 And this other paper around the same time, I think maybe a little bit later, is ResNeXt, 562 01:09:58,125 --> 01:10:04,383 and so this is again, the creators of ResNet continuing to work on pushing the architecture. 563 01:10:04,383 --> 01:10:18,576 And here they also had this idea of okay, let's indeed tackle this width thing more but instead of just increasing the width of this residual block through more filters they have structure. 564 01:10:18,576 --> 01:10:26,415 And so within each residual block, multiple parallel pathways and they're going to call the total number of these pathways the cardinality. 565 01:10:26,415 --> 01:10:36,317 And so it's basically taking the one ResNet block with the bottlenecks and having it be relatively thinner, but having multiple of these done in parallel. 566 01:10:38,395 --> 01:10:44,452 And so here you can also see that this both have some relation to this idea of wide networks, 567 01:10:44,452 --> 01:10:54,023 as well as to has some connection to the inception module as well right where we have these parallel, these layers operating in parallel. 568 01:10:54,023 --> 01:10:58,190 And so now this ResNeXt has some flavor of that as well. 569 01:11:00,838 --> 01:11:13,878 So another approach towards improving ResNets was this idea called Stochastic Depth and in this work the motivation is well let's look more at this depth problem. 570 01:11:13,878 --> 01:11:21,537 Once you get deeper and deeper the typical problems that you're going to have vanishing gradients right. 571 01:11:21,537 --> 01:11:32,071 You're not able to, your gradients will get smaller and eventually vanish as you're trying to back propagate them over very long layers, or a large number of layers. 572 01:11:32,071 --> 01:11:43,045 And so what their motivation is well let's try to have short networks during training and they use this idea of dropping out a subset of the layers during training. 573 01:11:43,045 --> 01:11:48,436 And so for a subset of the layers they just drop out the weights and they just set it to identity connection, 574 01:11:48,436 --> 01:11:56,126 and now what you get is you have these shorter networks during training, you can pass back your gradients better. 575 01:11:56,126 --> 01:12:04,074 It's also a little more efficient, and then it's kind of like the drop out right. It has this sort of flavor that you've seen before. 576 01:12:04,074 --> 01:12:08,108 And then at test time you want to use the full deep network that you've trained. 577 01:12:10,446 --> 01:12:19,038 So these are some of the works that looking at the resident architecture, trying to understand different aspects of it and trying to improve ResNet training. 578 01:12:19,038 --> 01:12:32,253 And so there's also some works now that are going beyond ResNet that are saying well what are some non ResNet architectures that maybe can also work better, or comparable or better to ResNets. 579 01:12:32,253 --> 01:12:45,273 And so one idea is FractalNet, which came out pretty recently, and the argument in FractalNet is that while residual representations maybe are not actually necessary, so this goes back to what we were talking about earlier. 580 01:12:45,273 --> 01:12:52,645 What's the motivation of residual networks and it seems to make sense and there's, you know, good reasons for why this should help but in this paper 581 01:12:52,645 --> 01:12:58,407 they're saying that well here is a different architecture that we're introducing, there's no residual representations. 582 01:12:58,407 --> 01:13:03,898 We think that the key is more about transitioning effectively from shallow to deep networks, 583 01:13:03,898 --> 01:13:13,258 and so they have this fractal architecture which has if you look on the right here, these layers where they compose it in this fractal fashion. 584 01:13:14,769 --> 01:13:18,639 And so there's both shallow and deep pathways to your output. 585 01:13:20,045 --> 01:13:29,568 And so they have these different length pathways, they train them with dropping out sub paths, and so again it has this dropout kind of flavor, 586 01:13:29,568 --> 01:13:37,203 and then at test time they'll use the entire fractal network and they show that this was able to get very good performance. 587 01:13:39,047 --> 01:13:44,886 There's another idea called Densely Connected convolutional Networks, DenseNet, and this idea 588 01:13:44,886 --> 01:13:48,567 is now we have these blocks that are called dense blocks. 589 01:13:48,567 --> 01:13:55,940 And within each block each layer is going to be connected to every other layer after it, in this feed forward fashion. 590 01:13:55,940 --> 01:14:00,362 So within this block, your input to the block is also the input to every other conv layer, 591 01:14:00,362 --> 01:14:08,779 and as you compute each conv output, those outputs are now connected to every layer after and then, these are all concatenated 592 01:14:08,779 --> 01:14:18,643 as input to the conv layer, and they do some they have some other processes for reducing the dimensions and keeping efficient. 593 01:14:18,643 --> 01:14:30,863 And so their main takeaway from this, is that they argue that this is alleviating a vanishing gradient problem because you have all of these very dense connections. 594 01:14:30,863 --> 01:14:37,324 It strengthens feature propagation and then also encourages future use right because there are so many of these 595 01:14:37,324 --> 01:14:45,487 connections each feature map that you're learning is input in multiple later layers and being used multiple times. 596 01:14:47,906 --> 01:15:03,006 So these are just a couple of ideas that are you know alternatives or what can we do that's not ResNets and yet is still performing either comparably or better to ResNets and so this is another very active area of current research. 597 01:15:03,006 --> 01:15:11,830 You can see that a lot of this is looking at the way how different layers are connected to each other and how depth is managed in these networks. 598 01:15:13,528 --> 01:15:17,991 And so one last thing that I wanted to mention quickly, is just efficient networks. 599 01:15:17,991 --> 01:15:33,994 So this idea of efficiency and you saw that GoogleNet was a work that was looking into this direction of how can we have efficient networks which are important for you know a lot of practical usage both training as well as especially deployment and so this is 600 01:15:33,994 --> 01:15:37,927 another recent network that's called SqueezeNet 601 01:15:37,927 --> 01:15:41,618 which is looking at very efficient networks. They have these things called fire modules, 602 01:15:41,618 --> 01:15:49,645 which consists of a squeeze layer with a lot of one by one filters and then this feeds then into an expand layer with one by one and three by three filters, 603 01:15:49,645 --> 01:15:59,220 and they're showing that with this kind of architecture they're able to get AlexNet level accuracy on ImageNet, but with 50 times fewer parameters, 604 01:15:59,220 --> 01:16:06,093 and then you can further do network compression on this to get up to 500 times smaller than AlexNet 605 01:16:06,093 --> 01:16:10,095 and just have the whole network just be 0.5 megs. 606 01:16:10,095 --> 01:16:20,062 And so this is a direction of how do we have efficient networks model compression that we'll cover more in a lecture later, but just giving you a hint of that. 607 01:16:21,856 --> 01:16:26,809 OK so today in summary we've talked about different kinds of CNN Architectures. 608 01:16:26,809 --> 01:16:31,555 We looked in-depth at four of the main architectures that you'll see in wide usage. 609 01:16:31,555 --> 01:16:35,553 AlexNet, one of the early, very popular networks. 610 01:16:35,553 --> 01:16:38,832 VGG and GoogleNet which are still widely used. 611 01:16:38,832 --> 01:16:45,906 But ResNet is kind of taking over as the thing that you should be looking most when you can. 612 01:16:45,906 --> 01:16:50,587 We also looked at these other networks in a little bit more depth at a brief level overview. 613 01:16:51,921 --> 01:16:58,228 And so the takeaway that these models that are available they're in a lot of [mumbles] so you can use them when you need them. 614 01:16:58,228 --> 01:17:06,827 There's a trend toward extremely deep networks, but there's also significant research now around the design of how do we connect layers, 615 01:17:06,827 --> 01:17:15,419 skip connections, what is connected to what, and also using these to design your architecture to improve gradient flow. 616 01:17:15,419 --> 01:17:22,748 There's an even more recent trend towards examining what's the necessity of depth versus width, residual connections. 617 01:17:22,748 --> 01:17:31,380 Trade offs, what's actually helping matters, and so there's a lot of these recent works in this direction that you can look into some of the ones I pointed out if you are interested. 618 01:17:31,380 --> 01:17:33,597 And next time we'll talk about Recurrent neural networks.